CRM-20565 - Improve ajax dedupe lookups on contact add form #10341

colemanw · 2017-05-12T17:38:33Z

CRM-20565: Better & more configurable dedupe lookups when adding a contact

eileenmcnaughton · 2017-05-12T21:02:46Z

I'm going to try to take a look at this next week

colemanw · 2017-05-12T21:38:45Z

colemanw · 2017-05-13T16:20:57Z

CRM/Admin/Form/Preferences/Display.php

+    $this->addRadio('contact_ajax_check_similar', ts('Check for Similar Contacts'), array(
+      '1' => ts('While Typing'),
+      '0' => ts('When Saving'),
+      '2' => ts('Never'),


Note: In all existing installs this preference is already set to either '1' or '0'. These new options keep the original meaning of 1 and 0 so no upgrade script is needed.

colemanw · 2017-06-03T00:54:20Z

Thoughts on this one @eileenmcnaughton ?

eileenmcnaughton · 2017-06-12T22:01:45Z

Sorry - I still have this on my radar - I'm expecting to do significant work on dedupes going into next quarter - with extracting to an extension & making all interaction with the dedupeFinder & merger via api being high priorities

colemanw · 2017-10-15T20:16:28Z

@eileenmcnaughton I've rebased this. Can you take a look?

eileenmcnaughton · 2017-10-19T05:25:00Z

I've started looking at this & in general I agree with the change. I'm still working through it so these are not final comments.

I note with the api call we need to pass 'check_permissions' to the contact.get call - perhaps it will filter out some info, or not, but it's the right thing to do.
I'm not a fan of how that function does different things depending on return. I'm kinda inclined to deprecate it rather than hack it. Then we would wind up with 2 functions eventually

Contact.getmatches
Contact.getduplicates

The former is this function - matches for the data we have. The second would be pairs of duplicates to potentially merge (like the merge screen uses but less nasty). (The second obviously does not exist as yet.).

I want to look a bit more closely at a tidy up of the admin form for this.
From a user point of view we have the situation where if I have the default settings I no longer get presented with duplicates until I have entered first name, last name AND email. This is a sort of regression from getting them as soon as I've entered last name.

In general the defaults for Supervised & Unsupervised seem to be the reverse of what they should be. ie. Unsupervised is happy to given anyone the bash if there is a matching email whereas Supervised is conservative about putting up options. I'm not quite sure how to deal with this but as a user I feel like I would want matches as soon as I've entered first name in & then the top matches should get increasingly specific as I provide more data.

@lcdservices I think you understand the thinking behind Supervised vs Unsupervised - which I'm grappling with.

colemanw · 2017-10-19T16:06:11Z

You're right. I've added it.
Actually it returns the same format. If you leave off return then it sends back:

[
12 => ['id' => 12],
34 => ['id' => 34],
]

and if you add e.g. return => ['display_name', 'email']` then it sends back:

[
  12 => ['id' => 12, 'display_name' => 'Joe Tester', 'email' => 'joe@test.er'],
  34 => ['id' => 34, 'display_name' => 'Jim Tester', 'email' => 'jim@test.er'],
]

Ok. Note my inline comment in the PR about that.
Well the code in this PR doesn't do that, it starts looking for matches as soon as you enter something in any field. But the supervised dedupe rule might get in your way depending on how you've set it up. A more lax rule will indeed give more results. What would be really nice would be to sort those matches by weight (e.g. closer matches first) but our BAO doesn't currently do that I don't think :(

eileenmcnaughton · 2017-10-19T20:48:41Z

Regarding 4 - I'm just working off what people have as the default out-of-the-box settings. I realise people can customise it but most don't. The experience would be that they are used to having it resolve when they enter last name & now it won't - so it would feel like a bug/regression to a lot of people.

eileenmcnaughton · 2017-10-19T20:49:10Z

(your reply to 2 makes sense & I agree that is fine)

colemanw · 2017-10-20T00:38:39Z

I just tested and you're right about number 4. It's not very useful that it only works when all 3 fields are an exact match. I honestly don't know how that qualifies as a "supervised" rule as anyone with exactly the same first&last name and email is unquestionably the same person.

That said, what we had in place prior to this PR wasn't exactly gold-standard either. It matched on last name and nothing else, so Jan Smith would match Bob Smith, etc. - also not very useful results, and not at all configurable as it is now.

eileenmcnaughton · 2017-10-20T02:08:14Z

My instinct is that the Supervised & Unsupervised rules are messed up & were implemented in reverse of how they were conceived - the reason I was hoping to get input from @lcdservices

I feel like this WOULD feel like a regression to some users as is & I feel like we should consider adding a third rule type that exists by default called 'InputAssistance' or something & making it match on last name or email being sufficient.

Overall I feel like this might not be universally better & we should probably solicit wider input. Possibly from the dev list although it's more a UI question than a dev question but our only other list seems to be 'partners' which is definitely not right and please not a double post....

From a code point of view I'm happy with the code apart from wanting to dig a bit further into the admin form side of things and agree it is an improvement. I just realised I haven't tested that we still do awful ghastly things to people when they choose ' on save ' for the setting

lcdservices · 2017-10-20T02:46:52Z

@eileenmcnaughton sorry -- missed the earlier ping on this. I'm happy to give my 2 cents on the rules... ;-)

I've never liked that the unsupervised rule was so loose by default (email only). I argued for stricter default rules many versions ago but Dave G felt that since email is the only field required on contrib forms and event reg forms, and the inclusion of profiles is optional, we shouldn't create a default dedupe rule that may be invalidated by the default configuration of contrib pages/events. While I appreciate that argument, I would argue we should opt for a more strict rule regardless, as the unsupervised rule should always err on the side of being more restrictive (to prevent false positives). Further -- since that discussion several years ago we've added a validation rule so that if the user selects a dedupe rule where the fields are not present in the form via the profile, we throw a warning that the configuration is not capable of creating a match.
In principle, I agree with @eileenmcnaughton that the two rules are better switched. Unsupervised should match on more values, Supervised on less.
I also agree that the supervised rule would benefit from being more nuanced -- which I gather is the impetus for this PR. personally I think a match on first name + last name is a sufficient default rule.
the two default "optimized" rules would benefit from some review. those were created in conjunction with the creation of the dedupe hook, (v4.2?). we had that hook added for NYSS and proposed the optimized rules at that time (that's when the discussion about defaults took place). but I feel like they might benefit from a code review. (I realize that's not the purview of this issue)

lcdservices · 2017-10-20T02:50:47Z

One more comment after reading the JIra issue --
I agree that the lookup should be done via ajax or postprocess validate, but not both -- and I'd vote for the AJAX method. in my thinking, the ajax method should derive from the supervised rule, and should only be triggered when all fields in that rule actually have data in them. so if the rule is defined as first + last name, it should not be triggered after the first name is added, but only after both first and last have been added.

eileenmcnaughton · 2017-10-20T08:00:46Z

So to attempt to pull things together - I think both @lcdservices & I would be happy to see only TWO options

ajax checks
no check

I think you'd struggle to find anyone who likes the on-save validation - but we could ask.

We don't think the default rules are right (which would affect usablity of this) but the question is - how do you deal with that. The best thing I can think of is to add a new type that is 'fit-for-purpose' by default. I feel like last name OR email match would be a good default for a data-entry rule (although there might be performance issues with that default on larger sites).

colemanw · 2017-10-23T15:58:51Z

I agree about a made-for-deduping rule. I actually custom coded one of those for Woolman years ago. It did all kinds of fuzzy OR logic, taking phone numbers, nick names, email, etc into consideration. It wasn't very performant but it was awesome at nailing hard-to-find dupes
https://github.com/woolman/drupal-modules/blob/master/woolman_website/woolman_dupe_query.inc#L54.

eileenmcnaughton · 2017-11-16T02:58:55Z

I had another go at this and the usability seemed ok if I created a new for-purpose rule & used that instead of the 'Supervised' rule (in practice I did this by editing the Supervised rule to have a threshold of 5 & deleting CRM/Dedupe/BAO/QueryBuilder/IndividualSupervised.php). Using the existing built in rule didn't give me any feedback until all 3 fields matched which was a worse user experience than without the patch.

With my 'new custom rule' the performance was 'ok' even on the wmf database. By that I mean that on many names it resolved in real time but even on 'John Smith' it didn't cause queries to spin off and bring the database down. I did get a very strange UI experience on the slow ones as nothing showed up & then suddenly I had 3 matching boxes of matches.

However, the problem with the code as is is that the names are sorted by Sort name. Since there are more than 20 Eileen's in our database the list of Eileen's displayed once I entered first_name did not include me. Once I entered my email it fired again - but presented the list of names in the same order so I was still only to see the first 20 possible matches which include me.

I tried to make it sort by weight so the best matches would float to the top - the code is my efforts, but while I managed to get the api to sort by weight the UI still did not.

Doing data entry the whole thing would work better if the email address field were the first one you hit because then it could try to match on that first & also I think it's realistically more important for dataentry than fields like Job Title & Contact Type - tangental...

diff --git a/CRM/Dedupe/BAO/RuleGroup.php b/CRM/Dedupe/BAO/RuleGroup.php
index 7557af3..f374863 100644
--- a/CRM/Dedupe/BAO/RuleGroup.php
+++ b/CRM/Dedupe/BAO/RuleGroup.php
@@ -211,8 +211,7 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
     $exclWeightSum = array();
 
     // create temp table
-    $dao = new CRM_Core_DAO();
-    $dao->query($tempTableQuery);
+    $dao = CRM_Core_DAO::executeQuery($tempTableQuery);
 
     CRM_Utils_Hook::dupeQuery($this, 'table', $tableQueries);
 
Discard this hunk from worktree [y,n,q,a,d,/,j,J,g,e,?]? q

wmf1411:civicrm emcnaughton$ git diff
diff --git a/CRM/Dedupe/BAO/RuleGroup.php b/CRM/Dedupe/BAO/RuleGroup.php
index 7557af3..f374863 100644
--- a/CRM/Dedupe/BAO/RuleGroup.php
+++ b/CRM/Dedupe/BAO/RuleGroup.php
@@ -211,8 +211,7 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
     $exclWeightSum = array();
 
     // create temp table
-    $dao = new CRM_Core_DAO();
-    $dao->query($tempTableQuery);
+    $dao = CRM_Core_DAO::executeQuery($tempTableQuery);
 
     CRM_Utils_Hook::dupeQuery($this, 'table', $tableQueries);
 
@@ -384,7 +383,8 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
       $query = "SELECT dedupe.id1 as id
                 FROM dedupe JOIN civicrm_contact ON dedupe.id1 = civicrm_contact.id {$this->_aclFrom}
                 WHERE contact_type = '{$this->contact_type}' {$this->_aclWhere}
-                AND weight >= {$this->threshold}";
+                AND weight >= {$this->threshold}
+                 ORDER BY weight DESC ";
     }
     else {
       $this->_aclWhere = ' AND c1.is_deleted = 0 AND c2.is_deleted = 0';
@@ -399,7 +399,8 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
                        LEFT JOIN civicrm_dedupe_exception exc ON dedupe.id1 = exc.contact_id1 AND dedupe.id2 = exc.contact_id2
                 WHERE c1.contact_type = '{$this->contact_type}' AND
                       c2.contact_type = '{$this->contact_type}' {$this->_aclWhere}
-                      AND weight >= {$this->threshold} AND exc.contact_id1 IS NULL";
+                      AND weight >= {$this->threshold} AND exc.contact_id1 IS NULL
+                      ORDER BY weight DESC ";
     }
 
     CRM_Utils_Hook::dupeQuery($this, 'threshold', $query);
diff --git a/CRM/Dedupe/Finder.php b/CRM/Dedupe/Finder.php
index fd5d1c9..108fec4 100644
--- a/CRM/Dedupe/Finder.php
+++ b/CRM/Dedupe/Finder.php
@@ -148,8 +148,8 @@ class CRM_Dedupe_Finder {
 
     $rgBao->params = $params;
     $rgBao->fillTable();
-    $dao = new CRM_Core_DAO();
-    $dao->query($rgBao->thresholdQuery($params['check_permission']));
+    $dao = CRM_Core_DAO::executeQuery($rgBao->thresholdQuery($params['check_permission']));
+
     $dupes = array();
     while ($dao->fetch()) {
       if (isset($dao->id) && $dao->id) {
diff --git a/api/v3/Contact.php b/api/v3/Contact.php
index 38a1172..86846e0 100644
--- a/api/v3/Contact.php
+++ b/api/v3/Contact.php
@@ -1371,13 +1371,15 @@ function civicrm_api3_contact_duplicatecheck($params) {
   );
   $values = array();
   if ($dupes && !empty($params['return'])) {
-    return civicrm_api3('Contact', 'get', array(
+    $result = civicrm_api3('Contact', 'get', array(
       'return' => $params['return'],
       'id' => array('IN' => $dupes),
       'options' => CRM_Utils_Array::value('options', $params),
-      'sequential' => CRM_Utils_Array::value('sequential', $params),
       'check_permissions' => CRM_Utils_Array::value('check_permissions', $params),
     ));
+    
+    $result['values'] = array_replace(array_flip($dupes), $result['values']);
+    return $result;
   }
   foreach ($dupes as $dupe) {
     $values[$dupe] = array('id' => $dupe);
diff --git a/templates/CRM/Contact/Form/Contact.tpl b/templates/CRM/Contact/Form/Contact.tpl
index 384fe48..cdc4b9a 100644
--- a/templates/CRM/Contact/Form/Contact.tpl
+++ b/templates/CRM/Contact/Form/Contact.tpl
@@ -300,7 +300,6 @@
       CRM.api3('contact', 'duplicatecheck', {
         match: match,
         rule_type: 'Supervised',
-        options: {sort: 'sort_name'},
         return: ['display_name', 'email']
       }).done(function(data) {
         var title = data.count == 1 ? {/literal}"{ts escape='js'}Similar Contact Found{/ts}" : "{ts escape='js'}Similar Contacts Found{/ts}"{literal},

eileenmcnaughton · 2018-05-10T06:39:26Z

@colemanw what should we do with this? I think it seems like to get it mergeable we would need to

create a new inbuilt rule ('date entry'?) that mimicked existing behaviour - people could edit from there.
figure out how to order by weight per the last comment

I think it's a nice improvement - but it also seems not quite mergeable at the moment & I don't think either of us are quite committed enough to get it over the line at the moment? If that is the case we should close the PR & track it from gitlab (we won't lose access to the changes - although if you ALSO delete the branch & we haven't linked into the commit specifically we will struggle to find it again

colemanw · 2018-05-15T17:55:51Z

The trouble with adding a new rule is that we bump against the limitations of the rule categories. Currently we have "Supervised" "Unsupervised" and "General". We could create a new category but that's yet another can-o-worms to open.

IMO we ought to create a new hard-coded rule called "Name or Email or Phone or Address" and make it the default Supervised rule. The current default Supervised rule isn't very good. For existing sites, if they have the current default "Name and Email" default Supervised rule, we switch it. If they've configured a custom Supervised rule, we leave it alone.

That could be done in a separate PR. Just to think out loud, I think the ideal fuzzy rule would do something like

IF (
  (first_name && last_name match) OR
  (nick_name && last_name match) OR
  (email matches) OR
  (phone matches) OR
  (street_address matches)
)

@eileenmcnaughton you know more about optimizing queries on huge databases than I do. Is the above too crazy for large databases to handle?

eileenmcnaughton · 2018-05-15T21:02:45Z

OK - so to answer that I think I need to feel more comfortable with how the existing Supervised rule is used - ie. audit the places. I realise my discomfort with the idea is about the places it is used. TODO audit this.

My inclination is that adding a DataEntry category is safer because it involves changing behaviour in less places. Where it differs logically from the 'normal' concept of 'Supervised' is that we would want it to be more aggressive about suggestions - ie. as you type 'Coleman' suggestions for 'Coleman' would appear (for a large DB they might tweak that but on a small DB that's a good rule).

Regarding whether it should be a hardcoded rule - I can do some tests - I'm not sure that the hard-coded rules are more efficient from my previous tests in this instance.

eileenmcnaughton · 2018-05-29T23:08:47Z

@colemanw I had an idea here.

On one side of this patch we have a significant UI improvement on the other we have a change in behaviour to the selection of which contacts are duplicates that makes me uncomfortable. Could we just get the first part commited - ie. write a (deprecated) api or ajax call that does the same look up as right now & use that & then review (if we choose) switching the mechanism?

colemanw · 2018-05-30T03:05:40Z

@eileenmcnaughton well the current implementation of the ajax lookup leaves a lot to be desired. It matches on last_name and nothing else.
I suppose now that the api3 supports the OR operator I could fire off a Contact.get api call to match on first name or last name or email, without going through the dedupe engine; at least that would be an improvement.

eileenmcnaughton · 2018-05-30T03:51:53Z

@colemanw Yep & we could get that merged & then consider improving the rules.

Per discussion on civicrm#10341 Since the default dedupe rules are inadequate for this, we'll just use the contact api for now.

colemanw · 2018-05-31T03:02:51Z

@eileenmcnaughton done. I also added some throttling to prevent multiple messages popping up if the user fills out the form quickly.

eileenmcnaughton · 2018-06-05T02:56:59Z

test this please

eileenmcnaughton · 2018-06-05T03:00:28Z

I just tested this & it feels better from a UI POV & I didn't spot any issues - I think it's good to merge

monishdeb · 2018-06-05T08:54:24Z

Merging as per tag.

monishdeb · 2018-06-05T08:55:15Z

@colemanw @eileenmcnaughton please close the JIRA ticket. Thanks!

civicrm-builder added the master label May 12, 2017

colemanw force-pushed the CRM-20565 branch from 528a74b to e50e50f Compare May 12, 2017 19:38

colemanw commented May 13, 2017

View reviewed changes

colemanw force-pushed the CRM-20565 branch from e50e50f to d506e9c Compare October 15, 2017 20:15

colemanw added 3 commits May 30, 2018 21:21

CRM-20565 - Improve ajax dedupe lookups on contact add form

d6def51

CRM-20565 - Add throttling to prevent multiple dupe checks popping up

6733671

CRM-20565 - Use contact api for now

01ee39a

Per discussion on civicrm#10341 Since the default dedupe rules are inadequate for this, we'll just use the contact api for now.

colemanw force-pushed the CRM-20565 branch from 6128f69 to 01ee39a Compare May 31, 2018 03:02

eileenmcnaughton added the merge on pass label Jun 5, 2018

monishdeb merged commit c47a8ed into civicrm:master Jun 5, 2018

eileenmcnaughton deleted the CRM-20565 branch June 5, 2018 08:56

darrick mentioned this pull request Jun 17, 2022

dev/core#3440 Fix issue with hardcoded dedupe rules fields used when adding contacts. #23353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRM-20565 - Improve ajax dedupe lookups on contact add form #10341

CRM-20565 - Improve ajax dedupe lookups on contact add form #10341

colemanw commented May 12, 2017 •

edited by civicrm-builder

eileenmcnaughton commented May 12, 2017

colemanw commented May 12, 2017

colemanw May 13, 2017 •

edited

colemanw commented Jun 3, 2017

eileenmcnaughton commented Jun 12, 2017

colemanw commented Oct 15, 2017

eileenmcnaughton commented Oct 19, 2017

colemanw commented Oct 19, 2017

eileenmcnaughton commented Oct 19, 2017

eileenmcnaughton commented Oct 19, 2017

colemanw commented Oct 20, 2017

eileenmcnaughton commented Oct 20, 2017

lcdservices commented Oct 20, 2017 •

edited

lcdservices commented Oct 20, 2017

eileenmcnaughton commented Oct 20, 2017

colemanw commented Oct 23, 2017 •

edited

eileenmcnaughton commented Nov 16, 2017

eileenmcnaughton commented May 10, 2018

colemanw commented May 15, 2018 •

edited

eileenmcnaughton commented May 15, 2018

eileenmcnaughton commented May 29, 2018

colemanw commented May 30, 2018

eileenmcnaughton commented May 30, 2018

colemanw commented May 31, 2018

eileenmcnaughton commented Jun 5, 2018

eileenmcnaughton commented Jun 5, 2018

monishdeb commented Jun 5, 2018

monishdeb commented Jun 5, 2018

CRM-20565 - Improve ajax dedupe lookups on contact add form #10341

CRM-20565 - Improve ajax dedupe lookups on contact add form #10341

Conversation

colemanw commented May 12, 2017 • edited by civicrm-builder

eileenmcnaughton commented May 12, 2017

colemanw commented May 12, 2017

colemanw May 13, 2017 • edited

Choose a reason for hiding this comment

colemanw commented Jun 3, 2017

eileenmcnaughton commented Jun 12, 2017

colemanw commented Oct 15, 2017

eileenmcnaughton commented Oct 19, 2017

colemanw commented Oct 19, 2017

eileenmcnaughton commented Oct 19, 2017

eileenmcnaughton commented Oct 19, 2017

colemanw commented Oct 20, 2017

eileenmcnaughton commented Oct 20, 2017

lcdservices commented Oct 20, 2017 • edited

lcdservices commented Oct 20, 2017

eileenmcnaughton commented Oct 20, 2017

colemanw commented Oct 23, 2017 • edited

eileenmcnaughton commented Nov 16, 2017

eileenmcnaughton commented May 10, 2018

colemanw commented May 15, 2018 • edited

eileenmcnaughton commented May 15, 2018

eileenmcnaughton commented May 29, 2018

colemanw commented May 30, 2018

eileenmcnaughton commented May 30, 2018

colemanw commented May 31, 2018

eileenmcnaughton commented Jun 5, 2018

eileenmcnaughton commented Jun 5, 2018

monishdeb commented Jun 5, 2018

monishdeb commented Jun 5, 2018

colemanw commented May 12, 2017 •

edited by civicrm-builder

colemanw May 13, 2017 •

edited

lcdservices commented Oct 20, 2017 •

edited

colemanw commented Oct 23, 2017 •

edited

colemanw commented May 15, 2018 •

edited